{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Swinging up the pendulum using DDPG\n",
    "\n",
    "In this section, let's implement the DDPG algorithm to train the agent for swinging up the\n",
    "pendulum. That is, we will have a pendulum which starts swinging from a random\n",
    "position and the goal of our agent is to swing the pendulum up so it stays upright.\n",
    "\n",
    "\n",
    "First, let's import the required libraries:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2.0.0\n"
     ]
    }
   ],
   "source": [
    "import tensorflow as tf\n",
    "print(tf.__version__)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import warnings\n",
    "warnings.filterwarnings('ignore')\n",
    "\n",
    "import numpy as np\n",
    "import gym"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For a clear understanding of how the DDPG method works, we use\n",
    "TensorFlow in the non-eager mode by disabling TensorFlow 2 behavior."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import tensorflow.compat.v1 as tf\n",
    "tf.disable_v2_behavior()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Creating the gym environment\n",
    "\n",
    "Let's create a pendulum environment using gym:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "env = gym.make(\"Pendulum-v0\").unwrapped"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Get the state shape of the environment:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "state_shape = env.observation_space.shape[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Get the action shape of the environment:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "action_shape = env.action_space.shape[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that the pendulum is a continuous environment and thus our action space consists of\n",
    "continuous values. So, we get the bound of our action space:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "action_bound = [env.action_space.low, env.action_space.high]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Defining the variables\n",
    "\n",
    "Now, let's define some of the important variables."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Define the discount factor, $\\gamma$: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "gamma = 0.9 "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Define the value of $\\tau$ which is used for soft replacement:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "tau = 0.01 "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Define the size of our replay buffer:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "replay_buffer = 10000 "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Define the batch size:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "batch_size = 32  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Defining the DDPG class\n",
    "\n",
    "Let's define the class called DDPG where we will implement the DDPG algorithm. For a clear understanding, you can also check the detailed explanation of code on the book."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "class DDPG(object):\n",
    "\n",
    "    #first, let's define the init method\n",
    "    def __init__(self, state_shape, action_shape, high_action_value,):\n",
    "        \n",
    "        #define the replay buffer for storing the transitions\n",
    "        self.replay_buffer = np.zeros((replay_buffer, state_shape * 2 + action_shape + 1), dtype=np.float32)\n",
    "    \n",
    "        #initialize the num_transitionsto 0 which implies that the number of transitions in our\n",
    "        #replay buffer is zero\n",
    "        self.num_transitions = 0\n",
    "            \n",
    "        #start the TensorFlow session\n",
    "        self.sess = tf.Session()\n",
    "        \n",
    "        #we learned that in DDPG, instead of selecting the action directly, to ensure exploration,\n",
    "        #we add some noise using the Ornstein-Uhlenbeck process. So, we first initialize the noise\n",
    "        self.noise = 3.0\n",
    "        \n",
    "        #initialize the state shape, action shape, and high action value\n",
    "        self.state_shape, self.action_shape, self.high_action_value = state_shape, action_shape, high_action_value\n",
    "        \n",
    "        #define the placeholder for the state\n",
    "        self.state = tf.placeholder(tf.float32, [None, state_shape], 'state')\n",
    "        \n",
    "        #define the placeholder for the next state\n",
    "        self.next_state = tf.placeholder(tf.float32, [None, state_shape], 'next_state')\n",
    "        \n",
    "        #define the placeholder for reward\n",
    "        self.reward = tf.placeholder(tf.float32, [None, 1], 'reward')\n",
    "        \n",
    "        #with the actor variable scope\n",
    "        with tf.variable_scope('Actor'):\n",
    "\n",
    "            #define the main actor network which is parameterized by phi. Actor network takes the state\n",
    "            #as an input and returns the action to be performed in that state\n",
    "            self.actor = self.build_actor_network(self.state, scope='main', trainable=True)\n",
    "            \n",
    "            #Define the target actor network which is parameterized by phi dash. Target actor network takes\n",
    "            #the next state as an input and returns the action to be performed in that state\n",
    "            target_actor = self.build_actor_network(self.next_state, scope='target', trainable=False)\n",
    "            \n",
    "        #with the critic variable scope\n",
    "        with tf.variable_scope('Critic'):\n",
    "            \n",
    "            #define the main critic network which is parameterized by theta. Critic network takes the state\n",
    "            #and also the action produced by the actor in that state as an input and returns the Q value\n",
    "            critic = self.build_critic_network(self.state, self.actor, scope='main', trainable=True)\n",
    "            \n",
    "            #Define the target critic network which is parameterized by theta dash. Target critic network takes\n",
    "            #the next state and also the action produced by the target actor network in the next state as\n",
    "            #an input and returns the Q value\n",
    "            target_critic = self.build_critic_network(self.next_state, target_actor, scope='target', \n",
    "                                                      trainable=False)\n",
    "            \n",
    "        \n",
    "        #get the parameter of the main actor network, phi\n",
    "        self.main_actor_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='Actor/main')\n",
    "        \n",
    "        #get the parameter of the target actor network, phi dash\n",
    "        self.target_actor_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='Actor/target')\n",
    "    \n",
    "        #get the parameter of the main critic network, theta\n",
    "        self.main_critic_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='Critic/main')\n",
    "        \n",
    "        #get the parameter of the target critic networ, theta dash\n",
    "        self.target_critic_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='Critic/target')\n",
    "\n",
    "        #perform the soft replacement and update the parameter of the target actor network and\n",
    "        #the parameter of the target critic network\n",
    "        self.soft_replacement = [\n",
    "\n",
    "            [tf.assign(phi_, tau*phi + (1-tau)*phi_), tf.assign(theta_, tau*theta + (1-tau)*theta_)]\n",
    "            for phi, phi_, theta, theta_ in zip(self.main_actor_params, self.target_actor_params, self.main_critic_params, self.target_critic_params)\n",
    "\n",
    "            ]\n",
    "        \n",
    "        #compute the target Q value, we learned that the target Q value can be computed as the\n",
    "        #sum of reward and discounted Q value of next state-action pair\n",
    "        y = self.reward + gamma * target_critic\n",
    "        \n",
    "        #now, let's compute the loss of the critic network. The loss of the critic network is the mean\n",
    "        #squared error between the target Q value and the predicted Q value\n",
    "        MSE = tf.losses.mean_squared_error(labels=y, predictions=critic)\n",
    "        \n",
    "        #train the critic network by minimizing the mean squared error using Adam optimizer\n",
    "        self.train_critic = tf.train.AdamOptimizer(0.01).minimize(MSE, name=\"adam-ink\", var_list = self.main_critic_params)\n",
    "        \n",
    "        \n",
    "        #We learned that the objective function of the actor is to generate an action that maximizes\n",
    "        #the Q value produced by the critic network. We can maximize the above objective by computing gradients \n",
    "        #and by performing gradient ascent. However, it is a standard convention to perform minimization rather \n",
    "        #than maximization. So, we can convert the above maximization objective into the minimization\n",
    "        #objective by just adding a negative sign.\n",
    "        \n",
    "        \n",
    "        #now we can minimize the actor network objective by computing gradients and by performing gradient descent\n",
    "        actor_loss = -tf.reduce_mean(critic)    \n",
    "           \n",
    "        #train the actor network by minimizing the loss using Adam optimizer\n",
    "        self.train_actor = tf.train.AdamOptimizer(0.001).minimize(actor_loss, var_list=self.main_actor_params)\n",
    "            \n",
    "        #initialize all the TensorFlow variables:\n",
    "        self.sess.run(tf.global_variables_initializer())\n",
    "\n",
    "       \n",
    "    #let's define a function called select_action for selecting the action with the noise to ensure exploration\n",
    "    def select_action(self, state):\n",
    "        \n",
    "        #run the actor network and get the action\n",
    "        action = self.sess.run(self.actor, {self.state: state[np.newaxis, :]})[0]\n",
    "        \n",
    "        #now, we generate a normal distribution with mean as action and standard deviation as the\n",
    "        #noise and we randomly select an action from this normal distribution\n",
    "        action = np.random.normal(action, self.noise)\n",
    "        \n",
    "        #we need to make sure that our action should not fall away from the action bound. So, we\n",
    "        #clip the action so that they lie within the action bound and then we return the action\n",
    "        action =  np.clip(action, action_bound[0],action_bound[1])\n",
    "        \n",
    "        return action\n",
    "        \n",
    "    #now, let's define the train function\n",
    "    def train(self):\n",
    "        \n",
    "        #perform the soft replacement\n",
    "        self.sess.run(self.soft_replacement)\n",
    "        \n",
    "        #randomly select indices from the replay buffer with the given batch size\n",
    "        indices = np.random.choice(replay_buffer, size=batch_size)\n",
    "        \n",
    "        #select the batch of transitions from the replay buffer with the selected indices\n",
    "        batch_transition = self.replay_buffer[indices, :]\n",
    "\n",
    "        #get the batch of states, actions, rewards, and next states\n",
    "        batch_states = batch_transition[:, :self.state_shape]\n",
    "        batch_actions = batch_transition[:, self.state_shape: self.state_shape + self.action_shape]\n",
    "        batch_rewards = batch_transition[:, -self.state_shape - 1: -self.state_shape]\n",
    "        batch_next_state = batch_transition[:, -self.state_shape:]\n",
    "\n",
    "        #train the actor network\n",
    "        self.sess.run(self.train_actor, {self.state: batch_states})\n",
    "        \n",
    "        #train the critic network\n",
    "        self.sess.run(self.train_critic, {self.state: batch_states, self.actor: batch_actions,\n",
    "                                          self.reward: batch_rewards, self.next_state: batch_next_state})\n",
    "        \n",
    "\n",
    "    #now, let's store the transitions in the replay buffer\n",
    "    def store_transition(self, state, actor, reward, next_state):\n",
    "\n",
    "        #first stack the state, action, reward, and next state\n",
    "        trans = np.hstack((state,actor,[reward],next_state))\n",
    "        \n",
    "        #get the index\n",
    "        index = self.num_transitions % replay_buffer\n",
    "        \n",
    "        #store the transition\n",
    "        self.replay_buffer[index, :] = trans\n",
    "        \n",
    "        #update the number of transitions\n",
    "        self.num_transitions += 1\n",
    "        \n",
    "        #if the number of transitions is greater than the replay buffer then train the network\n",
    "        if self.num_transitions > replay_buffer:\n",
    "            self.noise *= 0.99995\n",
    "            self.train()\n",
    "            \n",
    "\n",
    "    def build_actor_network(self, state, scope, trainable):\n",
    "        \n",
    "        #we define a function called build_actor_network for building the actor network. The\n",
    "        #actor network takes the state and returns the action to be performed in that state\n",
    "        with tf.variable_scope(scope):\n",
    "            layer_1 = tf.layers.dense(state, 30, activation = tf.nn.tanh, name = 'layer_1', trainable = trainable)\n",
    "            actor = tf.layers.dense(layer_1, self.action_shape, activation = tf.nn.tanh, name = 'actor', trainable = trainable)     \n",
    "            return tf.multiply(actor, self.high_action_value, name = \"scaled_a\")  \n",
    "\n",
    "\n",
    "    def build_critic_network(self, state, actor, scope, trainable):\n",
    "        \n",
    "        #we define a function called build_critic_network for building the critic network. The\n",
    "        #critic network takes the state and the action produced by the actor in that state and returns the Q value\n",
    "        with tf.variable_scope(scope):\n",
    "            w1_s = tf.get_variable('w1_s', [self.state_shape, 30], trainable = trainable)\n",
    "            w1_a = tf.get_variable('w1_a', [self.action_shape, 30], trainable = trainable)\n",
    "            b1 = tf.get_variable('b1', [1, 30], trainable = trainable)\n",
    "            net = tf.nn.tanh( tf.matmul(state, w1_s) + tf.matmul(actor, w1_a) + b1 )\n",
    "\n",
    "            critic = tf.layers.dense(net, 1, trainable = trainable)\n",
    "            return critic"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Training the network\n",
    "\n",
    "Now, let's start training the network. First, let's create an object to our DDPG class:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ddpg = DDPG(state_shape, action_shape, action_bound[1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Define the number of episodes:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "num_episodes = 300"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Define the number of time steps in each episode:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "num_timesteps = 500 "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#for each episode\n",
    "for i in range(num_episodes):\n",
    "    \n",
    "    #initialize the state by resetting the environment\n",
    "    state = env.reset()\n",
    "    \n",
    "    #initialize the return\n",
    "    Return = 0\n",
    "    \n",
    "    #for every step\n",
    "    for j in range(num_timesteps):\n",
    "        \n",
    "        #render the environment\n",
    "        env.render()\n",
    "\n",
    "        #select the action\n",
    "        action = ddpg.select_action(state)\n",
    "        \n",
    "        #perform the selected action\n",
    "        next_state, reward, done, info = env.step(action)\n",
    "        \n",
    "        #store the transition in the replay buffer\n",
    "        ddpg.store_transition(state, action, reward, next_state)\n",
    "      \n",
    "        #update the return\n",
    "        Return += reward\n",
    "    \n",
    "        #if the state is the terminal state then break\n",
    "        if done:\n",
    "            break\n",
    "    \n",
    "        #update the state to the next state\n",
    "        state = next_state\n",
    "\n",
    "    #print the return for every 10 episodes\n",
    "    if i %10 ==0:\n",
    "         print(\"Episode:{}, Return: {}\".format(i,Return))  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we learned how DDPG works and how to implement DDPG, in the next section,\n",
    "we will learn another interesting algorithm called twin delayed DDPG. "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
